Skip to content

[Feature] Optimizations for JPEG input on NVIDIA GPU#19749

Merged
yhyang201 merged 2 commits intosgl-project:mainfrom
wili-65535:wili/jpeg-preprocess
Mar 29, 2026
Merged

[Feature] Optimizations for JPEG input on NVIDIA GPU#19749
yhyang201 merged 2 commits intosgl-project:mainfrom
wili-65535:wili/jpeg-preprocess

Conversation

@wili-65535
Copy link
Copy Markdown
Contributor

@wili-65535 wili-65535 commented Mar 3, 2026

Motivation

Modifications

  • Use torch.ops.image.decode_jpegs_cuda, converting CPU bytes directly to torch GPU tensors using the nvJPEG hardware decoder.
  • This eliminates intermediate data formats such as PIL Images and CPU tensors, and minimizes CPU-GPU data transfers.

Accuracy Tests

  • lm_eval shows no drop between main branch and this PR, which get the similar score:

  • Command:

lm-eval run --model sglang --model_args pretrained=/workspace/Qwen3-VL-8B-Instruct,dtype=auto,tp_size=1 --tasks gpqa_diamond_zeroshot gpqa_extended_zeroshot gpqa_main_zeroshot gpqa_diamond_n_shot gpqa_extended_n_shot gpqa_main_n_shot
  • Before this optimization (main branch baseline)
Tasks Version Filter n-shot Metric Value Stderr
gpqa_diamond_n_shot 2 none 0 acc 0.3687 ± 0.0344
none 0 acc_norm 0.3687 ± 0.0344
gpqa_diamond_zeroshot 1 none 0 acc 0.3788 ± 0.0346
none 0 acc_norm 0.3788 ± 0.0346
gpqa_extended_n_shot 2 none 0 acc 0.3791 ± 0.0208
none 0 acc_norm 0.3791 ± 0.0208
gpqa_extended_zeroshot 1 none 0 acc 0.3773 ± 0.0208
none 0 acc_norm 0.3773 ± 0.0208
gpqa_main_n_shot 2 none 0 acc 0.3862 ± 0.0230
none 0 acc_norm 0.3862 ± 0.0230
gpqa_main_zeroshot 1 none 0 acc 0.4085 ± 0.0232
none 0 acc_norm 0.4085 ± 0.0232
  • After this optimization (this PR)
Tasks Version Filter n-shot Metric Value Stderr
gpqa_diamond_n_shot 2 none 0 acc 0.3687 ± 0.0344
none 0 acc_norm 0.3687 ± 0.0344
gpqa_diamond_zeroshot 1 none 0 acc 0.3788 ± 0.0346
none 0 acc_norm 0.3788 ± 0.0346
gpqa_extended_n_shot 2 none 0 acc 0.3790 ± 0.0185
none 0 acc_norm 0.3790 ± 0.0185
gpqa_extended_zeroshot 1 none 0 acc 0.3773 ± 0.0208
none 0 acc_norm 0.3773 ± 0.0208
gpqa_main_n_shot 2 none 0 acc 0.3862 ± 0.0230
none 0 acc_norm 0.3862 ± 0.0230
gpqa_main_zeroshot 1 none 0 acc 0.4085 ± 0.0232
none 0 acc_norm 0.4085 ± 0.0232
  • lmms_eval shows no drop between main branch and this PR, which get the similar score:

  • Command:

python3 -m lmms_eval --model sglang --model_args model_path=/workspace/Qwen3-VL-8B-Instruct --tasks mmmu_val
  • Before this optimization (main branch baseline)
{'Overall-Art and Design': {'num': 120, 'acc': 0.68333}, 'Art': {'num': 30, 'acc': 0.66667}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.40667}, 'Accounting': {'num': 30, 'acc': 0.3}, 'Economics': {'num': 30, 'acc': 0.5}, 'Finance': {'num': 30, 'acc': 0.26667}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.45333}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.3}, 'Physics': {'num': 30, 'acc': 0.53333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.66667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.6}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38095}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.3}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.3}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.50222}}
2026-03-16T05:00:15.655750+0000 | save_results_aggregated | INFO - Output path not provided, skipping saving results aggregated
sglang (model_path=/workspace/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 1

LMMs-Eval: Probing Intelligence in the Real World
> The unified evaluation toolkit for frontier models.

branch: main
commit: v0.6-72-g88b23e2b

| Tasks  |Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------|-----:|--------|---|-----:|---|------|
|mmmu_val|none  |     0|mmmu_acc|↑  |0.5022|±  |N/A   |
  • After this optimization (this PR)
{'Overall-Art and Design': {'num': 120, 'acc': 0.68333}, 'Art': {'num': 30, 'acc': 0.66667}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.40667}, 'Accounting': {'num': 30, 'acc': 0.3}, 'Economics': {'num': 30, 'acc': 0.5}, 'Finance': {'num': 30, 'acc': 0.26667}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.45333}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.3}, 'Physics': {'num': 30, 'acc': 0.53333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.66667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.6}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38095}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.3}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.3}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.50222}}
2026-03-16T05:11:01.956747+0000 | save_results_aggregated | INFO - Output path not provided, skipping saving results aggregated
sglang (model_path=/workspace/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 1

LMMs-Eval: Probing Intelligence in the Real World
> The unified evaluation toolkit for frontier models.

branch: main
commit: v0.6-72-g88b23e2b

| Tasks  |Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------|-----:|--------|---|-----:|---|------|
|mmmu_val|none  |     0|mmmu_acc|↑  |0.5022|±  |N/A   |

Benchmarking and Profiling

  • Part of the performance data is in the original issue, here we file more detailed data.
  • We use H100 GPU to run Qwen3-VL-8B model (actually it doesn't matter which model in use), sending requests with one JPEG image of different resolution, and focus on the log information like:
[2026-03-03 06:53:03] [QwenVLProcessor Perf] rid='44a05fff418f4b1cb448b345fa8ac336', load_time: 13.69 ms, preprocess_time: 0.00 ms, process_time: 307.90 ms, get_rope_index_time: 3.58 ms, total_time: 325.17 ms
  • The results before / after this PR are shown below. Averagely 1.5x acceleration is earned, and the larger the image is, the better performance is earned. In some extreme scenario (like the original issue shown), up to 3.8x acceleration might be earned.
size before after Speed Up
load_time/ms process_time/ms total_time/ms load_time/ms process_time/ms total_time/ms
32x32 0.58 1.18 2.16 0.62 1.18 2.15 1.00
64x64 0.34 1.55 2.16 0.57 1.86 2.91 0.74
96x96 0.52 2.32 3.12 0.57 1.25 2.09 1.49
128x128 0.37 2.40 3.05 0.56 1.25 2.08 1.47
160x160 0.52 1.97 2.76 0.60 1.24 2.12 1.30
192x192 0.35 1.66 2.27 0.63 1.29 2.20 1.03
224x224 0.37 1.94 2.57 0.69 1.28 2.26 1.14
256x256 0.48 2.02 2.77 0.68 1.24 2.19 1.26
288x288 0.43 1.96 2.64 0.74 1.12 2.14 1.23
320x320 0.45 2.22 2.93 0.77 1.19 2.26 1.30
352x352 0.57 2.32 3.15 0.82 1.25 2.34 1.35
384x384 0.52 2.65 3.43 0.86 1.33 2.47 1.39
416x416 0.54 2.84 3.63 0.88 1.43 2.58 1.41
448x448 0.56 3.09 3.91 0.97 1.57 2.84 1.38
480x480 0.66 3.60 4.51 1.11 1.60 2.99 1.51
512x512 0.63 3.87 4.77 1.10 1.90 3.29 1.45
544x544 0.68 4.52 5.48 1.10 1.83 3.20 1.71
576x576 0.68 5.05 6.00 1.30 2.29 3.90 1.54
608x608 0.73 5.16 6.16 1.25 2.18 3.73 1.65
640x640 0.76 5.34 6.42 1.70 2.68 4.67 1.37
672x672 0.79 5.92 6.98 1.56 2.27 4.10 1.70
704x704 0.83 6.72 7.84 1.83 2.49 4.60 1.70
736x736 0.82 7.00 8.09 1.67 2.79 4.81 1.68
768x768 0.99 7.27 8.54 2.05 2.94 5.30 1.61
800x800 0.98 8.22 9.49 2.16 3.01 5.46 1.74
832x832 1.13 8.31 9.72 2.13 3.43 5.88 1.65
864x864 1.09 9.58 10.95 2.57 3.25 6.11 1.79
896x896 1.24 9.83 11.35 2.30 3.62 6.22 1.82
928x928 1.21 10.95 12.44 2.46 3.86 6.63 1.88
960x960 1.23 10.88 12.39 2.54 4.52 7.36 1.68
992x992 1.45 12.61 14.34 2.81 4.09 7.36 1.95
1024x1024 1.41 12.89 14.60 2.90 4.04 7.24 2.02
  • A simple E2E test with Qwen3VL-8B, using 2 2048x2048 pictures.
============================================== Before this PR ============================================
Request 1 with picture 1:
[2026-03-20 07:12:56] [QwenVLProcessor Perf] rid='a622c8dc55994f01b6f6596cbc45bf6d', load_time: 11.30 ms, preprocess_time: 0.00 ms, process_time: 285.03 ms, get_rope_index_time: 3.70 ms, total_time: 300.03 ms
[2026-03-20 07:13:04] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 0, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00

图中展示的是一只电脑鼠标,从俯视角度拍摄。

它的主要特征如下:

- **外观设计**:鼠标整体呈流线型,表面为哑光黑色,带有细微的颗粒质感,显得非常现代和精致。
- **滚轮**:位于鼠标中央,是一个金属材质的滚轮,表面有清晰的环状纹理,便于手指抓握和操作。
- **结构**:鼠标左右两侧有弧形的侧裙,设计符合人体工学,旨在提供舒适的握持感。
- **背景与光线**:背景为纯黑色,突出了鼠标的轮廓和细节。光线从上方打下,在鼠标表面形成了柔和的高光,增强了立体感和质感。
- **底部**:鼠标底部可见网格状的防滑纹理,有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印,表明这张图片可能是由AI生成的。

总的来说,这是一张

Request 2 with picture 1 and 2:
[2026-03-20 07:13:13] [QwenVLProcessor Perf] rid='9360050de3284a6b8abf74974b09dd13', load_time: 8.10 ms, preprocess_time: 0.00 ms, process_time: 305.50 ms, get_rope_index_time: 0.65 ms, total_time: 314.25 ms
[2026-03-20 07:13:15] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 4096, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 356.51

根据您提供的两张图片,它们的共同点主要体现在以下几个方面:

1.  **核心主体相同**:两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同(一张是深灰色/黑色,另一张是白色),但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致,可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**:两款鼠标都采用了极简主义和现代的设计语言。线条流畅,表面光滑,整体造型圆润,没有多余的装饰,体现了简约、科技感的设计风格。

3.  **功能布局一致**:从图片中可以清晰地看到,两款鼠标都拥有一个位于中央的滚轮,滚轮下方有一个方形的按键(可能是前进/后退或功能键),两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

============================================== After this PR ============================================
Request 1 with picture 1:
[2026-03-20 07:24:36] [QwenVLProcessor Perf] rid='9c6013f20584466aad6716abae3f2d41', load_time: 8.87 ms, preprocess_time: 0.00 ms, process_time: 261.03 ms, get_rope_index_time: 0.89 ms, total_time: 270.79 ms
[2026-03-20 07:24:43] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 0, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00

图中展示的是一只电脑鼠标,从俯视角度拍摄。

它的主要特征如下:

- **外观设计**:鼠标整体呈流线型,表面为哑光黑色,带有细微的颗粒质感,显得非常现代和精致。
- **滚轮**:位于鼠标中央,是一个金属材质的滚轮,表面有清晰的环状纹理,便于手指抓握和操作。
- **结构**:鼠标左右两侧有弧形的侧裙,设计符合人体工学,旨在提供舒适的握持感。
- **背景与光线**:背景为纯黑色,突出了鼠标的轮廓和细节。光线从上方打下,在鼠标表面形成了柔和的高光,增强了立体感和质感。
- **底部**:鼠标底部可见网格状的防滑纹理,有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印,表明这张图片可能是由AI生成的。

总的来说,这是一张

Request 2 with picture 1 and 2:
[2026-03-20 07:24:52] [QwenVLProcessor Perf] rid='77ae5d49db6c4ce492ad8c7861e1462b', load_time: 6.68 ms, preprocess_time: 0.00 ms, process_time: 265.42 ms, get_rope_index_time: 0.67 ms, total_time: 272.76 ms
[2026-03-20 07:24:54] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 4096, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 357.54

根据您提供的两张图片,它们的共同点主要体现在以下几个方面:

1.  **核心主体相同**:两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同(一张是深灰色/黑色,另一张是白色),但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致,可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**:两款鼠标都采用了极简主义和现代的设计语言。线条流畅,表面光滑,整体造型圆润,没有多余的装饰,体现了简约、科技感的设计风格。

3.  **功能布局一致**:从图片中可以清晰地看到,两款鼠标都拥有一个位于中央的滚轮,滚轮下方有一个方形的按键(可能是前进/后退或功能键),两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

Request 3 with picture 1 again:
[2026-03-20 07:25:04] [QwenVLProcessor Perf] rid='19084840f467424893ce55672b9a006f', load_time: 2.85 ms, preprocess_time: 0.00 ms, process_time: 124.14 ms, get_rope_index_time: 0.43 ms, total_time: 127.42 ms
[2026-03-20 07:25:05] Prefill batch, #new-seq: 1, #new-token: 32, #cached-token: 4096, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 391.67

图中展示的是一只电脑鼠标,从俯视角度拍摄。

它的主要特征如下:

- **外观设计**:鼠标整体呈流线型,表面为哑光黑色,带有细微的颗粒质感,显得非常现代和精致。
- **滚轮**:位于鼠标中央,是一个金属材质的滚轮,表面有清晰的环状纹理,便于手指抓握和操作。
- **结构**:鼠标左右两侧有弧形的侧裙,设计符合人体工学,旨在提供舒适的握持感。
- **背景与光线**:背景为纯黑色,突出了鼠标的轮廓和细节。光线从上方打下,在鼠标表面形成了柔和的高光,增强了立体感和质感。
- **底部**:鼠标底部可见网格状的防滑纹理,有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印,表明这张图片可能是由AI生成的。

总的来说,这是一张

Request 4 with picture 1 and 2 again:
[2026-03-20 07:25:14] [QwenVLProcessor Perf] rid='ce500eeccd4747da8fb3eeea191d76df', load_time: 4.26 ms, preprocess_time: 0.00 ms, process_time: 258.15 ms, get_rope_index_time: 0.67 ms, total_time: 263.08 ms
[2026-03-20 07:25:15] Prefill batch, #new-seq: 1, #new-token: 32, #cached-token: 8192, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 3.08

根据您提供的两张图片,它们的共同点主要体现在以下几个方面:

1.  **核心主体相同**:两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同(一张是深灰色/黑色,另一张是白色),但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致,可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**:两款鼠标都采用了极简主义和现代的设计语言。线条流畅,表面光滑,整体造型圆润,没有多余的装饰,体现了简约、科技感的设计风格。

3.  **功能布局一致**:从图片中可以清晰地看到,两款鼠标都拥有一个位于中央的滚轮,滚轮下方有一个方形的按键(可能是前进/后退或功能键),两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Mar 4, 2026

/tag-and-rerun-ci

@wili-65535
Copy link
Copy Markdown
Contributor Author

Hi maintainers, could you help me understand the CI failures? I'd like to address them to move this PR forward.

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

Hi maintainers, could you help me understand the CI failures? I'd like to address them to move this PR forward.

CI might be flaky; please rerun until all checks pass.

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@wili-65535
Copy link
Copy Markdown
Contributor Author

wili-65535 commented Mar 10, 2026

Hi @yhyang201 @yuan-luo I investigate the CI report and find some information.

registered/vlm/test_vision_openai_server_a.py

  • In our PR, when processing JPEG images on NVIDIA GPUs, function python/sglang/srt/utils/common.py::load_image() directly returns a torch GPU tensor instead of a PIL Image (here).
  • This works fine for most model workflows because the subsequent image processing is in transformers/src/transformers/image_processing_utils_fast.py::_process_image(), which accepts various input types including torch tensors, PIL Images, and numpy arrays (here).
  • However, MiniCPM models have their own image pre-processing script that only accepts PIL Image inputs and internally converts them to numpy arrays using the .numpy() method (see code).
  • This incompatibility causes the following error when running MiniCPM models on NVIDIA GPUs (raised here):
openai.InternalServerError: Error code: 500 - {'object': 'error', 'message': "Internal server error: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", 'type': 'InternalServerError', 'param': None, 'code': 500}
  • In our testing, only MiniCPM-o-2_6 and MiniCPM-V-4 exhibit this issue. Should we implement any special handling in the unit tests for these specific models?

registered/lora/test_multi_lora_backend.py

Error information as below, reproduced stably.

KeyError: '/loky-7341-yz54xt5a'

xpu/test_intel_xpu_backend.py

Error information as below, reproduced stably.

AttributeError: module 'torch.xpu' has no attribute 'graph_pool_handle'

Some other tests:

Error information as below, but I don‘t what it means.

Error: Unhandled error: HttpError: <!DOCTYPE html>

@wili-65535 wili-65535 force-pushed the wili/jpeg-preprocess branch from 8ba3ec5 to d594058 Compare March 10, 2026 08:55
@wili-65535
Copy link
Copy Markdown
Contributor Author

For the fail unit tests of MiniCPM-o-2_6 and MiniCPM-V-4, we have several solutions:

  1. Fix in MiniCPM's Huggingface code (these tests pass after fixing):
    • Change here from image = image.numpy() to image = image.cpu().numpy();
    • Change here from if isinstance(images, Image.Image): to if isinstance(images, (Image.Image, torch.Tensor)):
    • Change here from elif isinstance(images[0], Image.Image): to elif isinstance(images[0], (Image.Image, torch.Tensor)):
  2. Skip the tests on NVIDIA GPU?
  3. Add a switch to turn off the optimization in this PR when using those models?

if discard_alpha_channel and img.mode != "RGB":
if (
discard_alpha_channel
and img.mode != "RGB"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also need a small adjustment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should still check the not isinstance(img, torch.Tensor) first ?

Copy link
Copy Markdown
Contributor Author

@wili-65535 wili-65535 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I lost the commit... fixed now.
By the way, in the encode_server.py, we cannot figure out the model kind (unless search name from self.server_args) easily.
So the gpu_image_decode is disabled by default here.

@yhyang201
Copy link
Copy Markdown
Collaborator

For the fail unit tests of MiniCPM-o-2_6 and MiniCPM-V-4, we have several solutions:

  1. Fix in MiniCPM's Huggingface code (these tests pass after fixing):

    • Change here from image = image.numpy() to image = image.cpu().numpy();
    • Change here from if isinstance(images, Image.Image): to if isinstance(images, (Image.Image, torch.Tensor)):
    • Change here from elif isinstance(images[0], Image.Image): to elif isinstance(images[0], (Image.Image, torch.Tensor)):
  2. Skip the tests on NVIDIA GPU?

  3. Add a switch to turn off the optimization in this PR when using those models?

You might consider option 3.

Some processors may only accept PIL images, so one possible approach is to add a switch to disable GPU image decoding for those models.

For example (just a quick idea, not very well thought through, just for reference):

# base_processor.py
class BaseMultimodalProcessor(ABC):
    gpu_image_decode = True  # Enable GPU decoding by default
    ...

    @staticmethod
    def _load_single_item(data, modality, ..., gpu_image_decode=True):
        if modality == Modality.IMAGE:
            img, _ = load_image(data, use_gpu=gpu_image_decode)
            ...

Then incompatible models could simply turn it off:

# minicpm.py
class MiniCPMMultimodalProcessor(BaseMultimodalProcessor):
    gpu_image_decode = False  # MiniCPM HF processor does not support tensor inputs

Just a quick thought for reference.

Also, llava.py appears to call load_image() as well, so it might be worth checking whether the same adjustment is needed there.

@wili-65535
Copy link
Copy Markdown
Contributor Author

You might consider option 3.

Some processors may only accept PIL images, so one possible approach is to add a switch to disable GPU image decoding for those models.

...

Also, llava.py appears to call load_image() as well, so it might be worth checking whether the same adjustment is needed there.

Good idea, let's try about this.
In addition, maybe AMD GPU can benefit from this switch in the future, too (https://lmsys.org/blog/2026-02-11-Qwen-latency/#221-image-decoding-optimization-with-rocjpeg).

@wili-65535 wili-65535 force-pushed the wili/jpeg-preprocess branch 2 times, most recently from 24bbdf0 to 798f0f1 Compare March 11, 2026 01:12
@wili-65535
Copy link
Copy Markdown
Contributor Author

@yhyang201 I manage to add the switch, could you help us to review it at your convenience?

@samuellees
Copy link
Copy Markdown
Contributor

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

It seems like this change may affect InternVL2.5 and KimiVL.
In CI, KimiVL fails at w, h = image.size, which suggests the processor might be receiving a tensor/array-like object instead of a PIL image. InternVL2.5 tests also regress in the same run, possibly due to a similar image input type issue.

@wili-65535
Copy link
Copy Markdown
Contributor Author

It seems like this change may affect InternVL2.5 and KimiVL. In CI, KimiVL fails at w, h = image.size, which suggests the processor might be receiving a tensor/array-like object instead of a PIL image. InternVL2.5 tests also regress in the same run, possibly due to a similar image input type issue.

Code of these two models are fixed.

@samuellees
Copy link
Copy Markdown
Contributor

samuellees commented Mar 14, 2026

/rerun-failed-ci again

@wili-65535
Copy link
Copy Markdown
Contributor Author

The result of GPQA tests is updated in the description. Could we move forward the PR?

@yhyang201
Copy link
Copy Markdown
Collaborator

Let me see what exactly is wrong with CI.

@yhyang201
Copy link
Copy Markdown
Collaborator

I’ll rebase and see if the CI passes.

max_dynamic_patch: Optional[int] = None


image_extension_names = (".png", ".jpg", ".jpeg", ".webp", ".gif")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a mm_utils.py in this folder after this PR
cc @yhyang201

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@mickqian
Copy link
Copy Markdown
Collaborator

great work. do we have e2e compare results BTW?

@wili-65535
Copy link
Copy Markdown
Contributor Author

wili-65535 commented Mar 20, 2026

great work. do we have e2e compare results BTW?

Thank you for your attention! @mickqian
We only run the mmmu_val task in lm_eval (shown as description).
What other end-to-end tests do you suggest we to add?

Furthermore, I add a result of E2E simple test with Qwen3VL-8B in the description.

@mickqian
Copy link
Copy Markdown
Collaborator

@wili-65535 I'm thinking we might need performance statistics on e2e benchmarks for this pr, you could check bench_serving.py or mmmu folder

v0.2: fix CI error

v2.0: add gpu_image_decode

v2.1: fix in encode_server.py

v2.2: fix more models
@wili-65535 wili-65535 force-pushed the wili/jpeg-preprocess branch from 1867b27 to af32299 Compare March 25, 2026 03:23
@yhyang201
Copy link
Copy Markdown
Collaborator

yhyang201 commented Mar 27, 2026

Use the a tool to conduct a latency test on Qwen3-VL-8B-Instruct (tp=1) with a single request while progressively increasing the number of images.

Each request contains N images of the same resolution (with N increasing from 1 to 32), a text input length of 256 tokens, and an output length of 32 tokens. The timeout for each individual request is set to 300 seconds.

Tests are conducted independently at three resolutions: 720p, 1080p, and 1440×2560. The server is restarted whenever switching resolutions. This setup is used to observe how the response time of a single request changes as the number of images increases.

For full experimental details, please refer to:
https://github.com/yhyang201/sgl-bench/tree/main/records/20260327/20260327_042330_qwen3_vl_8b_max_image_count_probe_1-32

main:

============================================================
Probing: 720p (1280x720)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=152ms, e2e=0.3s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=288ms, e2e=0.5s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=423ms, e2e=0.6s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=647ms, e2e=0.8s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=710ms, e2e=0.9s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=842ms, e2e=1.0s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1137ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1104ms, e2e=1.3s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1263ms, e2e=1.5s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1509ms, e2e=1.7s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1661ms, e2e=1.9s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1797ms, e2e=2.0s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2308ms, e2e=2.5s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2147ms, e2e=2.4s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2296ms, e2e=2.5s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2443ms, e2e=2.7s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2622ms, e2e=2.8s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2796ms, e2e=3.0s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=3182ms, e2e=3.4s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=3411ms, e2e=3.6s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=3523ms, e2e=3.8s)


============================================================
Probing: 1080p (1920x1080)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1080p... (0.0s) Sending... OK (TTFT=334ms, e2e=0.5s)
  [2 images] Generating 2x 1080p... (0.1s) Sending... OK (TTFT=656ms, e2e=0.8s)
  [3 images] Generating 3x 1080p... (0.1s) Sending... OK (TTFT=984ms, e2e=1.2s)
  [4 images] Generating 4x 1080p... (0.2s) Sending... OK (TTFT=1328ms, e2e=1.5s)
  [5 images] Generating 5x 1080p... (0.2s) Sending... OK (TTFT=1861ms, e2e=2.1s)
  [6 images] Generating 6x 1080p... (0.3s) Sending... OK (TTFT=2578ms, e2e=2.8s)
  [7 images] Generating 7x 1080p... (0.3s) Sending... OK (TTFT=2635ms, e2e=2.8s)
  [8 images] Generating 8x 1080p... (0.3s) Sending... OK (TTFT=3033ms, e2e=3.2s)
  [9 images] Generating 9x 1080p... (0.4s) Sending... OK (TTFT=3778ms, e2e=4.0s)
  [10 images] Generating 10x 1080p... (0.5s) Sending... OK (TTFT=4195ms, e2e=4.4s)
  [11 images] Generating 11x 1080p... (0.5s) Sending... OK (TTFT=5252ms, e2e=5.5s)
  [12 images] Generating 12x 1080p... (0.6s) Sending... OK (TTFT=5090ms, e2e=5.3s)
  [13 images] Generating 13x 1080p... (0.6s) Sending... OK (TTFT=6059ms, e2e=6.3s)
  [14 images] Generating 14x 1080p... (0.6s) Sending... OK (TTFT=6529ms, e2e=6.8s)
  [15 images] Generating 15x 1080p... (0.7s) Sending... OK (TTFT=7034ms, e2e=7.3s)
  [16 images] Generating 16x 1080p... (0.7s) Sending... OK (TTFT=7649ms, e2e=7.9s)
  [17 images] Generating 17x 1080p... (0.7s) Sending... OK (TTFT=8805ms, e2e=9.1s)
  [18 images] Generating 18x 1080p... (0.8s) Sending... OK (TTFT=9370ms, e2e=9.7s)
  [19 images] Generating 19x 1080p... (0.8s) Sending... OK (TTFT=9868ms, e2e=10.2s)
  [20 images] Generating 20x 1080p... (0.9s) Sending... OK (TTFT=10508ms, e2e=10.8s)
  [21 images] Generating 21x 1080p... (1.0s) Sending... OK (TTFT=11798ms, e2e=12.1s)
  [22 images] Generating 22x 1080p... (1.0s) Sending... OK (TTFT=13687ms, e2e=14.0s)
  [23 images] Generating 23x 1080p... (1.1s) Sending... OK (TTFT=13017ms, e2e=13.3s)
  [24 images] Generating 24x 1080p... (1.1s) Sending... OK (TTFT=13764ms, e2e=14.1s)
  [25 images] Generating 25x 1080p... (1.2s) Sending... OK (TTFT=15322ms, e2e=15.7s)
  [26 images] Generating 26x 1080p... (1.2s) Sending... OK (TTFT=15923ms, e2e=16.3s)
  [27 images] Generating 27x 1080p... (1.3s) Sending... OK (TTFT=16645ms, e2e=17.0s)
  [28 images] Generating 28x 1080p... (1.3s) Sending... OK (TTFT=17308ms, e2e=17.7s)
  [29 images] Generating 29x 1080p... (1.3s) Sending... OK (TTFT=19003ms, e2e=19.4s)


============================================================
Probing: 1440x2560 (2560x1440)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1440x2560... (0.1s) Sending... OK (TTFT=608ms, e2e=0.8s)
  [2 images] Generating 2x 1440x2560... (0.2s) Sending... OK (TTFT=1206ms, e2e=1.4s)
  [3 images] Generating 3x 1440x2560... (0.2s) Sending... OK (TTFT=2125ms, e2e=2.3s)
  [4 images] Generating 4x 1440x2560... (0.3s) Sending... OK (TTFT=3148ms, e2e=3.4s)
  [5 images] Generating 5x 1440x2560... (0.4s) Sending... OK (TTFT=4076ms, e2e=4.3s)
  [6 images] Generating 6x 1440x2560... (0.5s) Sending... OK (TTFT=4921ms, e2e=5.2s)
  [7 images] Generating 7x 1440x2560... (0.6s) Sending... OK (TTFT=7037ms, e2e=7.3s)
  [8 images] Generating 8x 1440x2560... (0.7s) Sending... OK (TTFT=7432ms, e2e=7.7s)
  [9 images] Generating 9x 1440x2560... (0.7s) Sending... OK (TTFT=8358ms, e2e=8.6s)
  [10 images] Generating 10x 1440x2560... (0.8s) Sending... OK (TTFT=10356ms, e2e=10.7s)
  [11 images] Generating 11x 1440x2560... (0.9s) Sending... OK (TTFT=11502ms, e2e=11.8s)
  [12 images] Generating 12x 1440x2560... (1.0s) Sending... OK (TTFT=13678ms, e2e=14.0s)
  [13 images] Generating 13x 1440x2560... (1.1s) Sending... OK (TTFT=16127ms, e2e=16.4s)
  [14 images] Generating 14x 1440x2560... (1.2s) Sending... OK (TTFT=17453ms, e2e=17.8s)

@yhyang201
Copy link
Copy Markdown
Collaborator

This pr:

============================================================
Probing: 720p (1280x720)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=145ms, e2e=0.3s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=274ms, e2e=0.4s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=394ms, e2e=0.6s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=620ms, e2e=0.8s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=662ms, e2e=0.8s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=786ms, e2e=1.0s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1088ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1044ms, e2e=1.2s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1196ms, e2e=1.4s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1444ms, e2e=1.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1589ms, e2e=1.8s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1717ms, e2e=1.9s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2198ms, e2e=2.4s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2017ms, e2e=2.2s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2165ms, e2e=2.4s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2327ms, e2e=2.5s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2491ms, e2e=2.7s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2619ms, e2e=2.8s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=3037ms, e2e=3.3s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=3217ms, e2e=3.4s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=3388ms, e2e=3.6s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=3558ms, e2e=3.8s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=3721ms, e2e=4.0s)
  [24 images] Generating 24x 720p... (0.4s) Sending... OK (TTFT=3888ms, e2e=4.1s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=4681ms, e2e=4.9s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=4338ms, e2e=4.6s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=4400ms, e2e=4.6s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=4945ms, e2e=5.2s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=5115ms, e2e=5.4s)
  [30 images] Generating 30x 720p... (0.6s) Sending... OK (TTFT=5310ms, e2e=5.6s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=5499ms, e2e=5.8s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=5696ms, e2e=6.0s)


============================================================
Probing: 1080p (1920x1080)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1080p... (0.0s) Sending... OK (TTFT=319ms, e2e=0.5s)
  [2 images] Generating 2x 1080p... (0.1s) Sending... OK (TTFT=620ms, e2e=0.8s)
  [3 images] Generating 3x 1080p... (0.1s) Sending... OK (TTFT=928ms, e2e=1.1s)
  [4 images] Generating 4x 1080p... (0.2s) Sending... OK (TTFT=1270ms, e2e=1.5s)
  [5 images] Generating 5x 1080p... (0.2s) Sending... OK (TTFT=1764ms, e2e=2.0s)
  [6 images] Generating 6x 1080p... (0.3s) Sending... OK (TTFT=2476ms, e2e=2.7s)
  [7 images] Generating 7x 1080p... (0.3s) Sending... OK (TTFT=2501ms, e2e=2.7s)
  [8 images] Generating 8x 1080p... (0.4s) Sending... OK (TTFT=2899ms, e2e=3.1s)
  [9 images] Generating 9x 1080p... (0.4s) Sending... OK (TTFT=3595ms, e2e=3.8s)
  [10 images] Generating 10x 1080p... (0.5s) Sending... OK (TTFT=4053ms, e2e=4.3s)
  [11 images] Generating 11x 1080p... (0.5s) Sending... OK (TTFT=5092ms, e2e=5.3s)
  [12 images] Generating 12x 1080p... (0.6s) Sending... OK (TTFT=4930ms, e2e=5.2s)
  [13 images] Generating 13x 1080p... (0.6s) Sending... OK (TTFT=5876ms, e2e=6.1s)
  [14 images] Generating 14x 1080p... (0.6s) Sending... OK (TTFT=6330ms, e2e=6.6s)
  [15 images] Generating 15x 1080p... (0.7s) Sending... OK (TTFT=6809ms, e2e=7.1s)
  [16 images] Generating 16x 1080p... (0.7s) Sending... OK (TTFT=7419ms, e2e=7.7s)
  [17 images] Generating 17x 1080p... (0.8s) Sending... OK (TTFT=8593ms, e2e=8.9s)
  [18 images] Generating 18x 1080p... (0.8s) Sending... OK (TTFT=9080ms, e2e=9.4s)
  [19 images] Generating 19x 1080p... (0.8s) Sending... OK (TTFT=9621ms, e2e=9.9s)
  [20 images] Generating 20x 1080p... (0.9s) Sending... OK (TTFT=10225ms, e2e=10.5s)
  [21 images] Generating 21x 1080p... (1.0s) Sending... OK (TTFT=11544ms, e2e=11.9s)
  [22 images] Generating 22x 1080p... (1.0s) Sending... OK (TTFT=13289ms, e2e=13.6s)
  [23 images] Generating 23x 1080p... (1.0s) Sending... OK (TTFT=12652ms, e2e=13.0s)
  [24 images] Generating 24x 1080p... (1.0s) Sending... OK (TTFT=13362ms, e2e=13.7s)
  [25 images] Generating 25x 1080p... (1.1s) Sending... OK (TTFT=14938ms, e2e=15.3s)
  [26 images] Generating 26x 1080p... (1.2s) Sending... OK (TTFT=15568ms, e2e=15.9s)
  [27 images] Generating 27x 1080p... (1.2s) Sending... OK (TTFT=16179ms, e2e=16.5s)
  [28 images] Generating 28x 1080p... (1.3s) Sending... OK (TTFT=16904ms, e2e=17.3s)
  [29 images] Generating 29x 1080p... (1.3s) Sending... OK (TTFT=18674ms, e2e=19.0s)
  [30 images] Generating 30x 1080p... (1.3s) Sending... OK (TTFT=19359ms, e2e=19.7s)


============================================================
Probing: 1440x2560 (2560x1440)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1440x2560... (0.1s) Sending... OK (TTFT=579ms, e2e=0.8s)
  [2 images] Generating 2x 1440x2560... (0.2s) Sending... OK (TTFT=1151ms, e2e=1.3s)
  [3 images] Generating 3x 1440x2560... (0.3s) Sending... OK (TTFT=2029ms, e2e=2.2s)
  [4 images] Generating 4x 1440x2560... (0.3s) Sending... OK (TTFT=3032ms, e2e=3.2s)
  [5 images] Generating 5x 1440x2560... (0.4s) Sending... OK (TTFT=3917ms, e2e=4.1s)
  [6 images] Generating 6x 1440x2560... (0.5s) Sending... OK (TTFT=4746ms, e2e=5.0s)
  [7 images] Generating 7x 1440x2560... (0.6s) Sending... OK (TTFT=6823ms, e2e=7.1s)
  [8 images] Generating 8x 1440x2560... (0.7s) Sending... OK (TTFT=7181ms, e2e=7.4s)
  [9 images] Generating 9x 1440x2560... (0.7s) Sending... OK (TTFT=8164ms, e2e=8.4s)
  [10 images] Generating 10x 1440x2560... (0.8s) Sending... OK (TTFT=10054ms, e2e=10.3s)
  [11 images] Generating 11x 1440x2560... (0.9s) Sending... OK (TTFT=11166ms, e2e=11.5s)
  [12 images] Generating 12x 1440x2560... (1.0s) Sending... OK (TTFT=13351ms, e2e=13.7s)
  [13 images] Generating 13x 1440x2560... (1.1s) Sending... OK (TTFT=15695ms, e2e=16.0s)
  [14 images] Generating 14x 1440x2560... (1.2s) Sending... OK (TTFT=17002ms, e2e=17.3s)

  Result: 1440x2560 max = 14 images

@yhyang201
Copy link
Copy Markdown
Collaborator

This PR reduces TTFT by about 3–5% overall, with the most noticeable improvement (~5%) at 720p and smaller gains at higher resolutions.

@yhyang201
Copy link
Copy Markdown
Collaborator

All CI checks have passed — should we go ahead and merge?

@yhyang201 yhyang201 merged commit 5bb9ca0 into sgl-project:main Mar 29, 2026
578 of 664 checks passed
@wili-65535 wili-65535 deleted the wili/jpeg-preprocess branch March 30, 2026 02:21
satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
realray808 pushed a commit to Ascend/sglang that referenced this pull request Apr 3, 2026
* [AMD] Fix AMD CI monitor GitHub API rate limit exhaustion (sgl-project#21527)

* [CI] Register missing jit_kernel test files (sgl-project#21547)

* [diffusion] fix: return None instead of raising RuntimeError when no model info found (sgl-project#21319)

Co-authored-by: Mick <mickjagger19@icloud.com>

* [rl][sgl] fix tensor mismatch after pause (sgl-project#21514)

* [Hicache & JIT_kernel] Support page first layout  & mla jit kernel (sgl-project#18311)

* test: point DSV3 int8 MLA CI models to lmsys Hugging Face org (sgl-project#21561)

* [CI] Relax several thresholds in flaky CIs (sgl-project#21562)

* feat: add gc_threshold arg (sgl-project#21481)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix flaky test_pp_single_node (sgl-project#21564)

* Split workflow for releasing runtime docker (sgl-project#21563)

* fix tp capture in vit cuda graph (sgl-project#17255)

* [1/n] lora support - Auto detect lora target modules (sgl-project#21439)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* [fix] qwen3.5 fuse_moe_triton_tune bug (sgl-project#20232)

* Remove sync when enabling return_logprob (sgl-project#20972)

* Scope streaming backlog coalescing to incremental_streaming_output mode (sgl-project#21037)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* docs: flesh out MAINTAINER.md oncall lists and link GitHub profiles (sgl-project#21575)

* [NVIDIA] Enable automatic NUMA configuration (sgl-project#19452)

* [diffusion] UX: aggregate expected dtype-cast logs during weight loading (sgl-project#21552)

* [diffusion] refactor: Unify `TeaCacheParams` and `WanTeaCacheParams` (sgl-project#20706)

Co-authored-by: Mick <mickjagger19@icloud.com>

* [diffusion] chore: remove redundant identity preprocess_text functions(sgl-project#20633)

Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>

* Update CODEOWNERS for transformers.py and docs (sgl-project#21555)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* reduce CPU peak memory in multimodal tensor hashing (sgl-project#21123)

* Fix HFRunner hang when subprocess dies during init (sgl-project#21582)

* Fix Piecewise CUDA Graph crash with `-enable-mixed-chunk` (sgl-project#20441)

Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>

* [CI] Replace upload/download-artifact with job outputs in release-docker workflow (sgl-project#21579)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (sgl-project#21586)

* [CI] Move v32 cp test to deepep running suite (sgl-project#21585)

* [AMD] Add GLM-4.7-FP8 accuracy CI test for MI35x (sgl-project#21534)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [Clean] Remove deprecated environs (sgl-project#21536)

* [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (sgl-project#21407)

* [CI] hot-fix ci lint (sgl-project#21608)

* [diffusion] feat: support overlay model materialization (sgl-project#21600)

* [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (sgl-project#21465)

* feat: enable CUDA graph and timestamp for the whisper model(sgl-project#21190)

* [NPU] Update quantization&CI documentation (sgl-project#21100)

Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>

* Skip ci for .md files (sgl-project#21482)

* Support skip-softmax attention (sgl-project#19089)

* fix: piecewise_cuda_graph get correct qo_indptr (sgl-project#21452)

Co-authored-by: Avery Huang <averyh@nvidia.com>

* fix bench_serving sglang backend to support image dataset  (sgl-project#21294)

* [AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (sgl-project#21442)

Co-authored-by: HaiShaw <hixiao@gmail.com>

* [GDN] Fuse GDN kkt + solve_tril into one kernel (sgl-project#21411)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (sgl-project#21616)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Clean up detokenizer and remove dead multimodal_gen code (sgl-project#21588)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [CI] Skip flaky elastic EP test (sgl-project#21619)

* feat(ci): add GB300 nightly benchmark test suites (sgl-project#21487)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [CI] Lossen test_return_routed_experts threshold (sgl-project#21270)

* Add subprocess liveness monitor to detect scheduler crashes (sgl-project#18582)

Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com>

* fix: scheduler launch hang when non-current rank dies (sgl-project#20287)

* Wrap IPv6 addresses in gRPC, bench_serving, and log messages (sgl-project#21236)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

* [HiCache] fix: graceful shutdown of pending async tasks in bench_mix.py (sgl-project#20276)

* Clean up _wait_for_scheduler_ready implementation (sgl-project#21626)

* fix cuda graph capturing error in sm120 mxfp8 triton path (sgl-project#19835)

* [sgl] disable piecewise cuda graph when a model doesn't have layers (sgl-project#21565)

* [Feature] Optimizations for JPEG input on NVIDIA GPU (sgl-project#19749)

* [VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (sgl-project#21418)

* [Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (sgl-project#19915)

* [Fix] Remove redundant allreduce fusion block and skip TP=1 (sgl-project#20621)

* Simplify routed experts test and move base64 encoding to tokenizer manager (sgl-project#21634)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (sgl-project#21640)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean up TokenizerManager: remove dead code and improve rid validation (sgl-project#21639)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* README: coding agent sponsorship for long-term contributors (sgl-project#21642)

* Fix circular reference in CustomTestCase.__init_subclass__ (sgl-project#21650)

Co-authored-by: wan4ch <wan4ch@gmail.com>

* [Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (sgl-project#21448)

Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com>

* [diffusion] CI: fix dashboard chart (nightly) display issues (sgl-project#21653)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update sponsorship details in README.md (sgl-project#21658)

* [Fix] Handle pre-release tags in nightly wheel version parsing (sgl-project#21656)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [Intel GPU] Enable DeepSeek R1 inference on XPU (sgl-project#18461)

Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>

* [Doc] Update tips for developer new-comers (sgl-project#21659)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests (sgl-project#21625)

* MFU metrics in Prometheus  (sgl-project#19395)

* fix topk softmax performance issue (sgl-project#14702)

* [CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (sgl-project#13121)

Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>

* [CPU] Implement MXFP4 Gemm kernels for intel AMX to support GPT OSS series. (sgl-project#14385)

* [AMD] Fused rope kv store (sgl-project#21315)

Co-authored-by: wunhuang <wunhuang@amd.com>

* [NPU] Update DeepSeek-V3.2 model deployment instructions in documentation (sgl-project#21468)

Co-authored-by: wuxue (C) <w00964934@china.huawei.com>

* [AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (sgl-project#21234)

* [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (sgl-project#21662)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix mamba cache leak when adder fails to add a matched req. (sgl-project#21404)

* fix: Mistral Small 4 fails to start due to config/weight format mismatch (sgl-project#21620)

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [diffusion] feat: enhance overlay mechanism (sgl-project#21648)

* [diffusion] CI: relax pr-test threshold (sgl-project#21682)

* [NPU][Diffusion] fix sp modulate for qwen-image-edit (sgl-project#20974)

Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>

* [NPU] fix eagle3 accept rate (sgl-project#21255)

* DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (sgl-project#14162)

Co-authored-by: undefined <zhouchen.arrebol@jd.com>

* [NPU] GLM-5 optimize with fused kernels (sgl-project#18617)

* [NPU][diffusion]: support parallel decoding of qwen-image (sgl-project#20757)

Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>

* [diffusion] [NPU] support ring attention on NPU with FA (sgl-project#21383)

* [diffusion][doc]: add ring sp performance benchmark page (sgl-project#20998)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (sgl-project#21660)

* fix nemotron capture for non attention layers (sgl-project#21436)

* [Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (sgl-project#21209)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>

* [AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (sgl-project#20410)

Co-authored-by: HaiShaw <hixiao@gmail.com>

* [AMD][MoRI] bump MoRI to v0.1.0 (sgl-project#21673)

* [AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (sgl-project#21691)

* Remove flashinfer wheel cache cleanup that deletes other versions (sgl-project#21711)

Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>

* [misc] multiprocess compilation to speed up test (sgl-project#21483)

* Fix human-eval CI install on 5090 runners (sgl-project#21714)

Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

* Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (sgl-project#21719)

* [Fix] Update supported custom_mem_pool types for mooncake (sgl-project#21728)

Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>

* [Perf]Remove H2D  for Qwen3.5 SpecV2 (sgl-project#20864)

* [AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model  (sgl-project#21621)

* [diffusion] fix: fix Flux.2 with tp(sgl-project#21664)

* Add explicit disable flag for FlashInfer allreduce fusion (sgl-project#21446)

* [NPU] fix conflict between empty_cache and use_mem_pool (sgl-project#21507)

* [AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (sgl-project#21657)

* [CI]Remove msgm-en and mmlu tests which cause timeout (sgl-project#21733)

* Fix disaggregation hybrid attention ci (sgl-project#21745)

* Rename rerun-ut to rerun-test (sgl-project#21747)

* bugfix(model):fix deepstack index out of range error (sgl-project#21727)

Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com>

* [diffusion] fix: fix typo (sgl-project#21746)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* [CI] Fix rerun-test suite detection to skip commented registrations (sgl-project#21753)

* [PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (sgl-project#21299)

Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal>

* [CI] Fix ring test timeout (sgl-project#21751)

* Enable evict swa with piecewise cuda graph (sgl-project#21754)

* Fix kimi-linear launch server error (sgl-project#21752)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [PD] Tiny cleanup after KVReceiver refactor (sgl-project#21760)

Signed-off-by: Shangming Cai <csmthu@gmail.com>

* Fix remote weight info nnode>1 and dp>1 (sgl-project#17389)

* [diffusion] UX: replace deprecated ORJSONResponse with orjson_response (sgl-project#21755)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [diffusion] fix: fix Wan2.2-I2V-A14B video max size issue(sgl-project#21390)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* [HiMambaTree]: Optimize mamba host lock mechanism (sgl-project#21750)

* [AMD] Fix Handle missing rope_theta in get_rope_config for Grok-1 (sgl-project#21518)

* [bugfix] Fix rope theta config for MiniMax after transformers v5 update (sgl-project#21241)

* Fix ineffective is_base_mistral CI patch for HF API rate limiting (sgl-project#21729)

* [2/n] lora - Shared outer experts and support qwen3_30b_a3b_instruct (sgl-project#21466)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* Fix cuda graph max bs capture upper bound (sgl-project#21005)

* [Fix] Fall back to triton MOE for GPT-OSS on Blackwell with driver >= 595 (sgl-project#21780)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cache nvidia wheels locally to skip repeated 830 MB downloads in CI (sgl-project#21778)

* Add Trivy vulnerability scanning to nightly dev Docker builds (sgl-project#21772)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [CI] Remove more redundant PCG tests (sgl-project#21554)

* [moe] add customized option to moe-a2a-backend (sgl-project#21786)

* Add CompletionSampler for non-chat eval in run_eval (sgl-project#21785)

* Remove redundant test_moe_eval_accuracy_large (sgl-project#21787)

* Increase hicache eval to 200 examples (sgl-project#21791)

* Switch MooncakeSpec to EAGLE3 + Llama-3.1 (sgl-project#21794)

* Reduce redundant speculative decoding CI tests (sgl-project#21779)

* Fix killall.py crash when sglang is not yet installed (sgl-project#21797)

* Remove obsolete sgl-kernel legacy paths (sgl-project#21528)

* [jit_kernel] Optimize fused_qknorm_rope: deduplicate sincosf for interleave RoPE  (sgl-project#21654)

* CUTLASS NVFP4 GEMM improvement of SM120 (sgl-project#21314)

* [gRPC] Preserve original ImportError in grpc_server.py (sgl-project#21801)

Signed-off-by: Chang Su <chang.s.su@oracle.com>

* [Misc] Tiny: Add test network timeouts and dynamic max-parallel for 5090/2-gpu runners (sgl-project#21800)

* Fix draft extend cuda graph when spec_step=1 (sgl-project#21709)

* [Diffusion] Add `--uvicorn-access-log-exclude-prefixes` to suppress noisy access logs (sgl-project#20379)

* Add latency and throughput metrics to run_eval (sgl-project#21793)

* [diffusion] CI: improve ci reliability (sgl-project#21763)

* [bugfix]GLM-4V model (sgl-project#17122)

* Fix CVEs in Docker image: pillow, linux-libc-dev, and broken sgl-model-gateway build (sgl-project#21789)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: only showing recent runners from ci failure analysis (sgl-project#21015)

* [MPS] Fix Triton stub sub-module imports on Python 3.12+ (sgl-project#21551)

Co-authored-by: karanb192 <karan@example.com>
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>

* [KDA] Fuse scaled_dot_kkt + solve_tril + recompute_w_u for KDA (sgl-project#21604)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* chore: bump flashinfer version to 0.6.7 (sgl-project#21422)

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* [3/n] lora moe - Support Qwen3-VL-30B-A3B-Instruct  (sgl-project#21469)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* [Feature Restoration] repetition_penalty is essential for GLM-V models (sgl-project#21258)

Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

* VLM: change default mm-attention backend from triton_attn to fa4 (on blackwell) (sgl-project#21595)

* Fix added tokens config with sensible filter (sgl-project#17905)

* [AMD] Optimize Qwen3-VL decode - fuse QK-norm + 3D mRoPE + KV cache write (sgl-project#21458)

Co-authored-by: Bingxu Chen <bingxche@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>

* [Bugfix] Fix PP tied embeddings weight loading for qwen3.5 4B dense model (sgl-project#21347)

* [CI] Fix lint that was not applied in sgl-project#21458 (sgl-project#21818)

* Bug fix for llama eagle3 (sgl-project#21397)

* glm_interleave for GLM-V (sgl-project#21671)

* style refinement for hisparse (sgl-project#21198)

* [Bug][VLM] Fix shared memory race condition in ShmPointerMMData broadcast for multi-GPU VLM serving (sgl-project#21655)

* [Bugfix] Fix effective_mamba_size over-allocation (sgl-project#20858)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* Fix in-place mode in pause generation (sgl-project#21705)

* [diffusion] fix: respect --prompt-path (sgl-project#21756)

* [NPU] update ascend docs (sgl-project#21807)

* [VLM] remove AsyncMMDataProcessor wrapper (sgl-project#21651)

* Use CustomTestCase for TestSessionControl to enable CI retry (sgl-project#21830)

* [NPU]Add a full test pipeline on NPU, resolve issues in the NPU test architecture (sgl-project#20751)

* [diffusion][CI]: Add individual component accuracy CI for diffusion models (sgl-project#18709)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] JIT rmsnorm update (with claude) (sgl-project#21834)

* [Diffusion][NPU] add ring sp performance benchmark page in npu (sgl-project#21811)

* fix(MiMo-V2-Flash): add mimo reasoning parser (sgl-project#21414)

* [diffusion] hardware: support FA3 attention backend on MUSA (attn backend, 14/N) (sgl-project#18648)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* fix: pre-init tokenizer_manager to avoid AttributeError in shutdown (sgl-project#21824)

* [FlashInver v0.6.7] Integrate flashinfer_trtllm mxfp8 gemm (sgl-project#21576)

* [Misc] Add network timeout to eval dataset downloads (sgl-project#21873)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [refactor] Clean up duplicate flashinfer trtllm moe code (sgl-project#21233)

* [DSA] Support trtllm sparse mla kernel for prefill batches  (sgl-project#21783)

* [Disagg] GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer (sgl-project#19890)

* Add merge prohibition policy during CI maintenance mode (sgl-project#21882)

* [Misc] Fix comparator e2e tests: add polars dep + fix dp-attention test (sgl-project#21804)

Co-authored-by: Alison Shao <alison.shao@mac.lan>

* revert: remove TTL-based hard pin from HiRadixCache (sgl-project#21884)

* Unify GSM8K eval path to Chat API for regression CI readiness (sgl-project#21667)

* [HiCache] fix: Clone host indices to avoid memory leak (sgl-project#21624)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* [HiCache & PD]Fixed detailed cache hit breakdown in PD scenarios. (sgl-project#21764)

* [CI] Add Llama 3.1 8B Instruct FP4 CI test on SM120 (sgl-project#20648)

* [CI] Add Per-Tensor, Blockwise FP8 Tests on SM120 (sgl-project#20717)

Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* Allow /rerun-test to checkout fork PR branch for trusted users (sgl-project#21890)

* Direct model loading from object storage with Runai Model Streamer (sgl-project#17948)

Signed-off-by: Noa Neria <noa@run.ai>

* fix pcg torch dynamo recompile in mxfp8 Triton path (sgl-project#21888)

Co-authored-by: Hanlin Bi <hanlinbi@umich.edu>

* chore: bump mooncake version to 0.3.10.post1 (sgl-project#21844)

* [VLM] Add VLM TP=4 per-commit CI test and improve MMMU eval prompt/parser (sgl-project#21841)

* fix(ci): update est_time for 57 tests based on runtime analysis (sgl-project#21896)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [CI] Increase multimodal server test timeout from 60 to 90 minutes (sgl-project#21897)

* [CI] Remove crashing Kimi K2.5 EAGLE3/MTP variants, keep TP8 and TP8+DP8 (sgl-project#21898)

* [diffusion] CI: add initial nvfp4 ci test for b200 (sgl-project#21767)

Co-authored-by: Mick <mickjagger19@icloud.com>

* Migrate all callers from /get_server_info to /server_info (sgl-project#21463)

* Support PP key for file backend (sgl-project#21901)

* Enable multi-thread weight loading by default (sgl-project#20289)

* Skip Go stdlib and NVIDIA tool CVEs in Trivy scan (sgl-project#21905)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [Kernel] Fuse temperature + softmax in sampling for decode speedup (sgl-project#20501)

* Multi tool streaming fix (sgl-project#20004)

* Return HTTP 400 for streaming validation errors (sgl-project#21900)

* [Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (sgl-project#21225)

* Fix ngram doc for speculative_num_draft_tokens default (sgl-project#21910)

* [NVIDIA] Enable fp8 flashinfer_trtllm_routed MoE for MiniMax-M2.5 (sgl-project#20394)

* scheduler: add prefill-only update in merge batch (sgl-project#21840)

* [DSA] Set trtllm kernels as nsa default for Blackwell (sgl-project#21914)

* Revert "Rollback flashmla to older version [1/2]" (sgl-project#21922)

* test: add manual init test for mooncake transfer engine (sgl-project#21842)

Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com>

* Fix spec v2 + logprob when max_num_token is set (sgl-project#20799)

* Migrate ngram corpus from torch cpp_extension to TVM FFI jit_kernel (sgl-project#21920)

Co-authored-by: DarkSharpness <2040703891@qq.com>

* [NPU] Support  GLM-4.7-Flash on NPU (sgl-project#21408)

* [CI] Fix gpu deps import in cpu test (sgl-project#21950)

* [Parallel State Refactor 1/n] Remove stream of PyNCCL (sgl-project#20866)

* [diffusion] chore: fix stage profiler for multi-stage denoising (sgl-project#21955)

* [CI] [Tracing] Add ci for tracing and fix bugs (sgl-project#21740)

* Remove logging for subprocess watchdog start (sgl-project#21968)

* [4/n] Support gpt oss 20b lora (sgl-project#21570)

* [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (sgl-project#17985)

Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>

* [Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile (sgl-project#19163)

* [CI] Remove stale Ascend suite entries from test/srt/run_suite.py (sgl-project#21978)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Skip broken AutoModel mapping entries when resolving Llava submodules (sgl-project#21892)

* [CI] Add timeouts to Slack upload urlopen and WebClient (sgl-project#21903)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [Diffusion][NPU] Add support for MOVA (sgl-project#21633)

Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com>

* Remove maxItems=1 restriction when tool_choice is specified (sgl-project#20208)

* [Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (sgl-project#19652)

* [PP] qwen3 vl skip layer id for pp (sgl-project#19135)

* [VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (sgl-project#21899)

* [Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (sgl-project#21519)

* Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (sgl-project#22002)

* [NPU] Optimized the wording in the npu docs (sgl-project#21998)

* [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (sgl-project#20871)

* [AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (sgl-project#21947)

Co-authored-by: wunhuang <wunhuang@amd.com>

* chore: bump sgl-kernel version to 0.4.1 (sgl-project#21447)

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>

* [Workflow] Avoid triggering nightly tests in kernel bump workflow (sgl-project#22010)

* [Workflow] Fix kernel release jobs skipped on push events (sgl-project#22011)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (sgl-project#21591)

Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* [Misc] Update CI permission (sgl-project#22014)

* [ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (sgl-project#21825)

* [CI] Fix duplicate job names that bypass branch protection (sgl-project#22001)

* fix: remove duplicate words in comments (sgl-project#22007)

* [PD] Tiny register info field cleanup for mooncake backend (sgl-project#22016)

* [NPU] optimize glm4.7 (sgl-project#19246)

* [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (sgl-project#21511)

* [AMD] Add MiniMax-M2.5 nightly perf benchmarks for MI30x and MI35x (sgl-project#21524)

---------

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Noa Neria <noa@run.ai>
Co-authored-by: Bingxu Chen <bingxche@amd.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: yang1002378395-cmyk <yang1002378395@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Bi Xue <bi@thinkingmachines.ai>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Muqi Li <muqi1029@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: narutolhy <582909902@qq.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
Co-authored-by: zhangxiaolei <zhangxiaolei.666@bytedance.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com>
Co-authored-by: Fengyuan Yu <Yuandao151112@163.com>
Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jacob0226 <jacchang@amd.com>
Co-authored-by: Aditya Sharma <89210949+adityavaid@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Артем Савкин <58187114+OrangeRedeng@users.noreply.github.com>
Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
Co-authored-by: Avery Huang <averyh@nvidia.com>
Co-authored-by: jacky.cheng <yichiche@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com>
Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com>
Co-authored-by: psaab <ps@meta.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Hanlin Bi <52993433+wolfcomos@users.noreply.github.com>
Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com>
Co-authored-by: saatwiknagpal <saatwiknagpal@gmail.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: wan4ch <wan4ch@gmail.com>
Co-authored-by: Feng Su <sufeng@linux.alibaba.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Polisetty V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: Ziang Li <ziangli@umich.edu>
Co-authored-by: Aishwarya Ramasethu <56765596+aramasethu@users.noreply.github.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Michelle Wu <michellewu351@gmail.com>
Co-authored-by: wuxue (C) <w00964934@china.huawei.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: LiYomi <106872109+LiYomi@users.noreply.github.com>
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: GXIN <37653830+gxxx-hum@users.noreply.github.com>
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
Co-authored-by: heziiop <q_m_p@qq.com>
Co-authored-by: xieminghe1 <141820649+xieminghe1@users.noreply.github.com>
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: Makcum888e <79456407+Makcum888e@users.noreply.github.com>
Co-authored-by: yuefeng Wu <33725817+ChefWu551@users.noreply.github.com>
Co-authored-by: Yuxuan Zhang <2448370773@qq.com>
Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com>
Co-authored-by: jhchouuu <jiahzhou@amd.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com>
Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>
Co-authored-by: Lewis <63569348+TTThanos@users.noreply.github.com>
Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
Co-authored-by: Jincong Chen <jincong.cjc@ant-intl.com>
Co-authored-by: xiazhahe <86939755+xiazhahe@users.noreply.github.com>
Co-authored-by: Thomas Wang <thomawan@amd.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: xiaoqi <xq25478@qq.com>
Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
Co-authored-by: weireweire <weiliangl@nvidia.com>
Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Co-authored-by: JD <jaedon.guo@gmail.com>
Co-authored-by: Zhangheng <hzh0425@apache.org>
Co-authored-by: Michael <13900043+michaelzhang-ai@users.noreply.github.com>
Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com>
Co-authored-by: Johnsonms <lizhaofu@gmail.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: KnightLTC <56717110+KnightLTC@users.noreply.github.com>
Co-authored-by: Douglas Yang <dyang@college.harvard.edu>
Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com>
Co-authored-by: karanb192 <karan@example.com>
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: sbeurnier <sbeurnier@together.ai>
Co-authored-by: YC Yen-Ching Tseng <yctseng@amd.com>
Co-authored-by: Wenyao Gao <105094497+edwingao28@users.noreply.github.com>
Co-authored-by: Alex Nails <alex.nails@radixark.ai>
Co-authored-by: khalilzhk <khalilzhk@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: yunkchen <chenyunkuo.cyk@alibaba-inc.com>
Co-authored-by: wduan-hai <wduan@humansand.ai>
Co-authored-by: amote-i <49533125+amote-i@users.noreply.github.com>
Co-authored-by: Cherry_ming <136634645@qq.com>
Co-authored-by: Ratish P <114130421+Ratish1@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Alison Shao <alison.shao@mac.lan>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Derek Yu <81697272+DerekY2@users.noreply.github.com>
Co-authored-by: Noa Neria <noa@run.ai>
Co-authored-by: Hanlin Bi <hanlinbi@umich.edu>
Co-authored-by: Prozac614 <dwt614707404@163.com>
Co-authored-by: David Cheung <d7cheung@gmail.com>
Co-authored-by: Mook <68294499+Godmook@users.noreply.github.com>
Co-authored-by: Khoa Pham <khoa.pham@radixark.ai>
Co-authored-by: foraxe <73625538+foraxe@users.noreply.github.com>
Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: ori <39351881+froststeam@users.noreply.github.com>
Co-authored-by: Thomas <zs033@qq.com>
Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com>
Co-authored-by: lviy <142899752+lviy@users.noreply.github.com>
Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com>
Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>
Co-authored-by: Ricardo-M-L <69202550+Ricardo-M-L@users.noreply.github.com>
Co-authored-by: Kelon <kelonlu@163.com>
Co-authored-by: cen121212 <luochen23@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants